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Abstract 

This paper deals with variable selection in the regression and binary classification frameworks. It proposes 
an automatic and exhaustive procedure which relies on the use of the CART algorithm and on model 
selection via penalization. This work, of theoretical nature, aims at determining adequate penalties, i.e. 
penalties which allow to get oracle type inequalities justifying the performance of the proposed procedure. 
Since the exhaustive procedure can not be executed when the number of variables is too big, a more 
practical procedure is also proposed and still theoretically validated. A simulation study completes the 
theoretical results. 
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1. Introduction 

This paper deals with variable selection in nonlinear regression and classification using CART estima- 
tion and model selection approach. Our aim is to propose a theoretical variable selection procedure for 
nonlinear models and to consider some practical approaches. 

Variable selection is a very important topic since we have to consider problems where the number of 
variables is very large while the number of variables that are really explanatory can be much smaller. This 
is the reason why we are interesting in their importance. The variable importance is a notion which allows 
the quantification of the ability of a variable to explain the studied phenomena. The formula, for the com- 
putation, depends on the considered model. In the literature, there are many variable selection procedures 
which combine p rimarily a concept of var iable importance and mo del estimation. If we refer to the work of 



Kohavi and John (IKohavi and John! II 199711 ) or Guyon and Elisseff (IGuvon and Elissefn 1200311 ). these meth- 
ods are "filter", "wrapper" or "embedded" methods. To summarize, (i) filter method is a pre-processing 
step which does not depend on the learning algorithm, (ii) in the wrapper method the learning model is 
used to induce the final model but also to search the optimal feature subset, and (iii) for embedded methods 
the features selection and the learning part can not be separated. 

Let us mention some of those methods, in the regression and/or the classification framework. 
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1.1. General framework and State of the Art 

Let us consider a linear regression model Y = Y!]=\ + s - %P + s where s is an unobservable noise, 
Y the response and X = (X 1 , . . . ,X P ) a vector of p explanatory variables. Let {(X,-, F,)i<;<„} be a sample, 
i.e. n independent copies of the pair of random variables (X, Y). 



The well-known Ordinary Least Square (OLS) estimator provides an useful way to estimate the vector 
/? but it suffers from a main drawback: it is not adapted to variable selection since, when p is large, many 
components of /3 are non zero. However, if OLS is not a convenient method to perform variable selection, 
the least squares criterion often appears in model selection. For example, Ridg e Regression and L asso 

koOlh in- 
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(wrapper methods) are penalized versions of OLS. Ridge Regression (see Hastie 
volves a L2 penalization which produces the shrinkage of ft but does not force any coefficients of (3 to be 
zero. So, Ridg e Regression is bet ter than OLS, but it is not a variable selection method unlike Lasso. Lasso 
1 1996]) uses the least squares criterion penalized by a L\ penalty term. By this 



Tibshirani 



(see Tibshirani 

way, Lasso shrinks some coefficients and puts the others to zero. Thus, this last method performs variable 
selection but computationally, its implementation needs quadratic programming techniques. 



Penalization is not the o nly way to perform variable or model selection. For example, we can cite the 

|2001]) which provides, for each k e {1, . . . ,p], the best subset of 
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Subset Selection (see Hastie 
size k, i.e. the subset of size k which gives smallest residual sum of squares. Then, by cross validation, the 
final subset is selected. This wrapper method is exhaustive: it is consequentl y difficult to use it i n practice 
when p is large. Often, Forward or Backward Stepwise Selection (see Hastie lHastie et al.l 11200 ill ) are pre- 



ferred since they are computationally efficient methods. But, they may eliminate useful predictors. Since 
they are not exhaustive methods they may not reach the global optimal model. In the regression fram ework, 
there exists an efficient algorithm developped by Furnival and Wilson (IFurnival and Wilson! II 197410 which 
achieves the optimal model, for a small number of explanatory variables, without exploring all the models. 



More recently, the mos t promising metho d seems to be the method called Least Angle Regression 
(LARS) due to Efron et al. (Efron et al.l 12004k Let p = x/3 where x = (Xf , . . . , X T n ). LARS builds an es- 
timate of p by successive steps. It proceeds by adding, at each step, one covariate to the model, as Forward 
Selection. At the begining, p = po = 0. At the first step, LARS finds the predictor X 71 most correlated with 
the response Y and increases po in the direction of X" until another predictor X' 1 has a larger correlation 
with the current residuals. Then, po is replaced by p\. This step corresponds to the first step of Forward 
Selection. But, unlike Forward Selection, LARS is based on an equiangulary strategy. For example, at the 
second step, LARS proceeds equiangulary between X-' 1 and X' 1 until another explanatory variable enters. 
This method is computationally efficient and gives good results in practice. However, a complete theoreti- 
cal elucidation needs further investigation. 

For linear regression, some works are also based on vari able importance a ssessment; the aim is to produce 
a relative importance of regressor variables. Gromping dGrompind 1200711 ) proposes a study of some esti- 
mators of relative importance based on variance decomposition. 



In the context of nonlinear models, Sobol (iSoboll II 199311 ) proposes an extension of the notion of rel- 
ative importan ce via the Sobol sen sitivity indices, indices which take part to the sensitivity analysis (cf. 
Saltelli et al. 



Saltelli et al 



1 200011 ). The idea of variable importance is not so recent since it can b e 
found in the book about Classification And Regression Trees of Breiman et al. (iBreiman et al.l 1198411 ) 
who introduce the variable importance as the decrease of node impurity measures, or in the studies about 
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Random Forests by Breiman et al. (iBreimanl 11200 ill . iBreiman and Cutler! B2005I0 where the variable im- 
portance is more a permutation importance index. Thanks to this notion, the variables can be ordered 
and we can easily deduce some filter or wrapper methods to select some of them. But, there exists 
also some embedded purposes based on those notions or some others. Thus, Diaz-Uriarte and Alvarez 
de Andres (iDfaz-Uriarte and de Andresl ifeOOaO propose the following recursive strategy. They compute 
the Random Forests variable importance and they delete the 20% of variables having the smallest im- 
portance: with the remaining variables, they construct a new forest and repeat the procedure. At the 
end, they compare all the forest models an d conserve the one having the smallest Out Of Bag error rate. 
Poggi and Tuleau (iPoggi and Tule au |2006|) develop a method based on CART and on a stepwise ascend- 
ing strategy combined with an elimination step while Genuer et al. (iGenuer et al.l) propose a procedure 
based on Random F orest combined with elimination, ranking and variable selection steps. Guyon et al. 
dGuvon et al.l |2002|P propose a method of selection, called SVM-RFE, utilizing Support Vector Machine 
metho ds based on Recursive Fe ature Elimination. Recently, this approach has been modified by Ben Ishak 
et al. dGhattas and Ishakl l2008h using a stepwise strategy. 



1.2. Main goals 

In this paper, the purpose is to propose, for regression and classification frameworks, a variable selec- 
tion procedure, based on CART, which is adaptative and theoretically validated. This second point is very 
important and establishes a real difference with existing works since actually most of the practical method 
for both frameworks are not validated because of the use of Random Forest or arbitrary thresholds on 
the variable importance. The method consists in applying the CART algorithm to each possible subset of 



variab les and then considering model selection via penalization (cf. Birge and Massart lBirge and Massart 



1 2007D), to select the set which minimizes a penalized criterion. In the regression and classification frame- 



works, we determine via oracle bounds, the expressions of this penalized criterion. 

More precisely, let X. = {{X\, Yi), . . ., (X„, Y„)} be a sample, i.e. independent copies of a pair {X, Y), 
where X takes its values in X, for example M. p , with distribution p and Y belongs to J/ (J/ = R in the re- 
gression framework and J/ = {0; 1} in the classification one). Let * be the regression function or the Bayes 
classifier according to the considered framework. We write X = (X 1 , . . .,X P ) where the p variables X-*, 
with j e {1, 2, . . . , p], are the explanatory variables. We denote by A the set of the p explanatory variables, 
i.e. A = {X l ,X 2 , . . . ,X P }. The explained variable Y is called the response. Wh en we deal with variable 

: the first one consists in 
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selection, there exists two different objectives (cf. Genuer et al. 
determining all the important variables highly related to the response Y whereas the second one is to find 
the smallest subset of variables to provide a good prediction of Y. Our purpose here is to find a subset M 
of A, as small as possible, such that the variables in M enable to predict the response Y. 

To achieve this objective, we split the sample X. in three subsamples .£1, -Li and £3 of size n\, «2 and 
«3 respectively. In the following, we consider two cases: the first one is "£i independent of X2" and the 
second corresponds to "£1 = ££'■ Then we apply the CART algorithm to all the su bsets of A (an overvie w 



of CART is given later and for more details, the reader can refer to Breiman et al. IBreiman et alj B198410 . 
More precisely, for any M e P{A), we build the maximal tree by the CART growing procedure using the 
subsample Xi- This tree, denoted T$2c, is constructed thanks to the class of admissible splits SpM which 
involves only the variables of M. For any M e !P(A) and any T < T^ x , we consider the space S m,t of 
L^(M'',yu) composed by all the piecewise constant functions with values in J/ and defined on the partition 
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t associated with the leaves of T. At this stage, we have the collection of models 



\S M j, M e f(A) and T < 
which depends only on ■ Then, for any (M, T), we denote sm.t the X.2 empirical risk minimizer on S m,t- 
S m ,t = argmin y„ 2 (u) with y„ 2 («) = — ^ (F; - m(X,)) 2 . 

Finally, we select (M, T) by minimizing the penalized contrast: 

(M, T) = argmin {y ni i.^M,T) + pen(M, T)} 

and we denote the corresponding estimator s - s^j. 

Our purpose is to determine the penalty function pen such that the model (M, T) is close to the optimal 
one. This means that the model selection procedure should satisfy an oracle inequality i.e.: 

E[l(s, s) [Xi] < C inf {e\1(s, s mt ) l-£il ), C close to 1 
(MJ)\ ' J J 

where / denotes the loss function and s the optimal predictor. The main results of this paper give adequate 
penalties defined up to two multiplicative constants a and /3. Thus we have a family of estimators s(a,/3) 
among which the final estimator is chosen using the test sample £3. This third sub-sample is admittedly 
introduce for practice but we consider it also in the theoretical part since we obtain some results on it. 



The described procedure is, of course, a theoretical one since, when p is too large, it may be impossi- 
ble, in practice, to take into account all the 2 P sets of variables. A solution consists of determining, at first, 
few data-driven subsets of variables which are adapted to perform variable selection and then applying our 
procedure to those subsets. As this family of subsets, denoted P*, is constructed thanks to the data, the 
theoretical penalty, determined when the procedure involves the 2 P sets, is still adapted for the procedure 
restricted to P* since this subset is not deterministic. 



The paper is organized as follows. After this introduction, the Section [2] recalls the different steps of 
the CART algorithm and defines some notations. The Sections |4]and[3]present the results obtained in the 
regression and classification frameworks. In both sections, the results have the same spirit, however since 
the frameworks differ, the assumptions and the penalty functions are different. This is the reason why, for 
clarity, we divide our results. In the Section [5] we apply our procedure to a simulated example and we 
compare the results of the procedure when, on the one hand, we consider all sets of variables and, on the 
other hand, we take into account on ly a subset determined thanks to the Variable Importance defined by 
Breiman et al. (iBreiman et al.l 1198411 '). Sections|6]and|7]collect lemmas and proofs. 



2. Preliminaries 

2.1. Overview of CART and variable selection 

In the regression and classification frameworks and thanks to a training set, CART splits recursively the 
observations space X and defines a piecewise constant function on this partition which is called a predictor 
or a classifier according to the case. CART proceeds in three steps: the construction of a maximal tree, 
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the construction of nested models by pruning and a final model selection. In t he following, we give a brief 
summary; for details, reader s may refer to the s emina l book of Breiman et al. (IBreiman et al.1 1198411 ) or to 
Gey's vulgarization articles (IGev and Nedeled 1200511 . iGevI) . 



The first step consists of the construction of a nested sequence of partitions of X using binary splits. A 
useful representation of this construction is a tree composed of nonterminal and terminal nodes. To each 
nonterminal node is associated a binary split which is just a question of the form (X^ < Cj) for numerical 
variables or (X^ € Sj) for qualitative ones. Such split involves only one original explanatory variable and is 
determined by maximizing a quality criterion derived from impurity function. For instance, in the regres- 
sion framework the criterion for a node t is the decrease of R(t) where R(t) = i Zx(et(Yi - Y(t)) 2 with Y(t) 
the arithmetical mean of Y over t. This is just the estimate of the error. In the classification framework, 
the criterion is the decrease in impurity which is often the Gini index i(t) = p(i|t)p(j|t) with p(i|t) the 
posterior probability of the class i in t. In this case, the criterion is less intuitive but the estimate of the 
misclassification rate has too many drawbacks to be used. The tree associated to the finest partition, that 
is to say the one with one observation or observations with the same response by element, is the maximal 
tree. This tree is too complex and too faithful with the training sample. This is the reason of the next step. 
The principle of the pruning step is to extract, from the maximal tree a sequence of nested subtrees whic h 



minimize a penalized criterion. This penalized criterion, proposed by Breiman et al. (IBreiman et al.l 119841) 
realizes a tradeoff between the goodness of fit and the complexity of the tree (or model) measured by the 
number of leaves. 

At last, via a test sample or cross validation, a subtree is selected among the preceding sequence. 

CART is an algorithm which builds a binary decision tree. A first idea is to perform variable selection 
by retaining the variables appearing in the tree. This has many drawbacks since on the one hand, the 
number of selected variables may be too large, and on the other hand, some really important variables 
could be hidden by the selected ones. 



A seco nd approach is based on the Variable Importance (VI) introduced by Breiman et al. (IBreiman et al 



1 1984]). This criterion, calculated with respect to a given tree (typically coming from the procedure CART), 
quantifies the contribution of each variable by awarding it a note between and 100. The variable selection 
consists of keeping the variables whose notes are greater than an arbitrary threshold. But, there is, at 
present, no way to automatically determine the threshold and such a method does not allow to suppress 
highly dependent influent variables. 

In this paper, we propose another approach which consists of applying CART to each subset of variables 
and choosing the set which minimizes an adequate penalized criterion. 

2.2. The context 

The paper deals with two frameworks: the regression and the binary classification. In both cases, we 
denote 

s = argmin E [y(u, (X, Y))] with y(u, (x,y)) = (y - u{x)f. (2.1) 

The quantity s represents the best predictor according to the quadratic contrast y. Since the distri- 
bution P is unknown, s is unknown too. Thus, in the regression and classification frameworks, we use 
(Xi, Y\), (X n , Y n ), independent copies of (X, Y), to construct an estimator of s. The quality of this one is 
measured by the loss function / defined by: 

l(s,u) = E[y(u,.)]-E[y(s,.)l (2.2) 
5 



In the regression case, the expression of s defined in ( 12.11 ) is 



Mx e W, s(x) = E[Y\X = x], 

and the loss function / given by (O is the L 2 (]J^,p) -norm, denoted ||.||„. 
In this context, each (Xj, Yj) satisfies 

Yi = siXd + Si 

where (s\, s„) is a sample such that E = 0. In the followin g, we assume that the variables e, have 

exponential moments around conditionally to X,. As explained in (ISauvel 11200910 . this assumption can be 
expressed by the existence of two constants cr e R* and p e R+ such that 



for any A e (-1/p, 1 /p) , log E [e le '\x,\ < 



2 }2 



cr l A 



2(1 -p\A\). 



(2.3) 



a 2 is necessarily greater than E(e?) and can be chosen as close to E(e?) as we want, but at the price of a 
larger p. 

Remark 1. If p — in ( I2.3D , f/je random variables £, are sa/af fo be sub-Gaussian conditionally to Xj. 



In the classification case, the Bayes classifier s, given by ( 12. U . is defined by: 

Vx € W, s(x) = H, w >i/2 with ?7(x) = E[7|Z = *]. 

As F and the predictors u take their values in {0; 1 }, we have y(u, (x, y)) = H«(^)^ y so we deduce that the loss 
function / can be expressed as: 

l(s, u) = F(Y + u(X)) - P(y + s(X)) = E [\s(X) - u(X)\\2j](X) - 1|] . 

For both frameworks, we consider two situations: 

• (Ml): the training sample X. is divided in three independent parts Xi, -Li and .£3 of size n\, «2 and 
«3 respectively. The subsample £,\ is used to construct the maximal tree, X.2 to prune it and £3 to 
perform the final selection; 

• (Ml): the training sample £. is divided only in two independent parts X.\ and £3. The first one is 
both for the construction of the maximal tree and its pruning whereas the second one is for the final 
selection. 

The (Ml) situation is theoretically easier since all the subsamples are independent, thus each step of 
the CART algorithm is performed on independent data sets. With real data, it is often difficult to split the 
sample in three parts because of the small number of data. That is the reason why we also consider the 
more realistic situation (M2). 

3. Classification 

This section deals with the binary classification framework. In this context, we know that the best 
predictor is the Bayes classifier s defined by: 
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A problem appears when r/(x) is close to 1/2, because in this case, the choice between the label 
and 1 is difficult. If W(tj(x) = 1 /2) + 0, then the accuracy of the Bayes classifier is not really good and 
the compa rison with s is no t relevant. For this reason, we consider the margin condition introduced by 
Tsybakov jTsvbakovl J200J): 



3h > 0, such that Vx € W, \2t](x) -l\>h. 



For details about this margin condition, we refer to Massart (iMassartl 1200311 ). Otherwise in (lArlot and Bartletth 

some considerations about margin-adaptive model selection could be found more precisely i n the case of 

nested models and with the use of the margin condition introduced by Mammen and Tsybakov dMammen and Tsybakov 
1199911 s ). 



The following subsection gives results on the variable selection for the methods (Ml) and (M2) under 
margin condition. More precisely, we define convenient penalty functions which lead to oracle bounds. 
The last subsection deals with the final selection by test sample X3. 

3.1. Variable selection via (Ml) and (M2) 

• (Ml) case : 
Given the collection of models 

[S M ,T, M e P(A) and T < 

built on£i, we use the second subsample £2 to select a model (M, T) which is close to the optimal one. 
To do this, we minimize a penalized criterion 

crit(M, T) = y„ 2 (s m ,t) + pen (M, T) 

The following proposition gives a penalty function pen for which the risk of the penalized estimator s = 
Sjfij. can be compared to the oracle accuracy. 

Proposition 1. Let consider a penalty function of the form: V M € !P(A) and V T < T^l 



Pen(M,T) = a^ h+ ^ h [l + lo g [^jj. 



If a > <?() and (3 > ySo, then there exists two positive constants C\ > 1 and C2, which only depend on a and 
P, such that: 



E 



Us, ~s)\j: x 



Ci inf {l(s, S M t)+ pen (M, T) ) + C 2 -^r 

(M,T){ ' ) mh 



where l{s,S m,t) — inf l(s,u). 

ueS m,t 



The penalty is the sum of two terms. The firs t one is proportional to and corresponds to the penalty 
proposed by breiamn et al. (IBreiman et alj 1198411 ) in their pruning algorithm. The other one is proportional 
to ^ (l + log (^j)) and is due to the variable selection. It penalizes models that are based on too much 
explanatory variables. For a given value of |M|, this result v alidat es the CART pruning algorithm in the 
binary classification framework, result proved also by Gey (IGevI) in a more general situation since the 
author consider a less stronger margin condition. 

Thanks to this penalty function, the problem can be divided in practice in two steps: 
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- First, for every set of variables M, we select a subtree f M of T,^l by 

T M = argmm \y„„(s M ,T) + a — 

T<T (M) ( «2 

This means that Tm is a tree obtained by the CART pruning procedure using the subsample £2 

- Then we choose a set M by minimizing a criterion which penalizes the big sets of variables: 

M = argmin \y m (s M f ) + pen(M, f M )\ . 

MeP(A) 

The (Ml) situation permits to work conditionally to the construction of the maximal trees T^l and to 
select a model among a deterministic collection. Finding a convenient penalty to select a model among a 
deterministic collection is easier, but we have not always enough observations to split the training sample 
£ in three subsamples. This is the reason why we study now the (M2) situation. 

• (M2) case : 



We manage to extend our result for only one subsample £ . But, while in the (Ml) method we work with 
the expected loss, here we need the expected loss conditionally to {Xj, (X,-, F,) e £} defined by: 

h(s,u) = P(u(X)i:Y\{X i , (X i ,Y i )e£ 1 })-P(s(X)^Y\{X i , (X ; , Y t ) e £1}) . (3.1) 

Proposition 2. Let consider a penalty function of the form: V M e !P(A) and V T < T^l 



pen(M, T) = a 



1 + (|M| + 1) 1 +log 



"1 



|M| + 1 



H +/j M( 1+log (JL 



n\h nih\ \\M\ 



Ifa>ao and /3 > fio, then there exists three positive constants C\ > 2, C2, 2 which only depend on a and 
P, such that, with probability > 1 — e^l?: 

h{s, s) < d inf L (s,S m ,t) + pen„(M, T)\ + (1 
(M,T){ ) n\h 

where l\(s,S m,t) — inf h(s, u). 

ueS M j 



When we consider the (M2) situation instead of the (Ml) one, we obtain only an inequality with high 
probability instead of a result in expectation, Indeed, since all the results are obtained conditionally to the 
construction of the maximal tree, in this second situation, it is impossible to integrate with respect to £1 
whereas in the first situation, we integrated with respect to X.2- 

Since the penalized criterion depends on two parameters a and ft, we obtain a family of predictors 
s - sgj. indexed by a and and the associated family of sets of variables M. Now, we choose the final 
predictor using test sample and we deduce the corresponding set of selected variables. 

3.2. Final selection 

Now, we have a collection of predictors 

Q = {s(a,f3); a > a and/? > (3 Q ) 

which depends on X.\ and X.2- 
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For any MofP (A), the set \t < 7^} is finite. As f> (A) is finite too, the cardinal 7C of @ is finite and 

MeP(A) 

where TCm is the num ber of subtrees of T^l obtained by the pruning algorithm defined by Breiman et al. 
( Breiman et al. 1 1984 1). Km is very smaller than IjT < 7^Sj|. Given the subsample X3, we choose the 
final estimator s by minimizing the empirical contrast y„, on Q. 

s = argmin y„ 3 (s(a,/3)) 

s(afi)E0 

The next result validates the final selection for the (Ml) method. 
Proposition 3. For any 77 € (0, 1), we have: 



E 



l(sj) Xal < -^-mfil(s,s(a,/3)) 

v ' J 1 — 77 («,/?) I 



773/7 



log7C + 



2n+i + I 

n-jh 



For the (M2) method, we get exactly the same result except that the loss I is replaced by the conditional 
loss h CD). 



For the (Ml) method, since the results in expectation of the Propositions [J and [3] involve the same 
expected loss, we can compare the final estimator s with the entire collection of models: 



3 



< Ci inf Ii(s,Smt) + pen(M,T)) + ■% + + logTf) 

(M,r) I J 772« mh\ ) 



In the classification framework, it may be possible to obtain sharper upper bounds by considering 
for instance the version of Talagrand concentration inequal ity developed by Ri o dRiol fl2002ll ). or a nothe r 
margin condition as the one proposed by Koltchinskii (see iKoltchinskiil 1200411 ) and used by Gey dGevi) . 
However, the idea remains the same and those improvement do not have a real interest since we do not get 
in our work precise calibration of the constants. 



4. Regression 

Let us consider the regression framework where the £, are supposed to have exponential moments 
around conditionally to X, (cf. 12.31 ). 

In this section, we add a stop-splitting rule in the CART growing procedure. During the construction 
of the maximal trees T^i, M e !P(A), a node is split only if the two resulting nodes contain, at least, N,„i„ 
observations. 



As in the classification section, the following subsection gives results on the variable selection for the 
methods (Ml) and (M2) and the last subsection deals with the final selection by test sample X.3- 

4.1. Variable selection via (Ml) and (M2) 

In this subsection, we show that for convenient constants a and /3, the same form of penalty function as 
in classification framework leads to an oracle bound. 



(Ml) case 



9 



Proposition 4. Let suppose that ||i||oo < R, with R a positive constant. 
Let consider a penalty function of the form: V M e f(A) and V T < T^i 



pen(M, T) = a [a 2 + pR) — + P (a 2 + pR) — + log 



P 

\M\ 



Ifp < log« 2 , N mi „ > 24 £7 logn 2 , a > a and ft > /3 , 

then there exists two positive constants C\ > 2 and C2, which only depend on a and /3, such that: 



E 



- s\t IXil < Ci inf j inf \\s - uf + pen(M, T)\ + C 2 

J (M,r) {ueS M j * J 



(cr 2 + pR) 

n2 



+C(p,cr,R)- 



«2 k>g 772 

where \\ . ||„, denotes the empirical norm on {X,-; (X,, 7,) 6 X2) and C(p,o~,R) is a constant which only 
depends on p, cr and R. 

As in classification, the penalty function is the sum of two terms: one is proportional to — and the 
other to ~~ (l + log ( jjTj))- The first term corresponds also to the penalty proposed by Breiman et al. 



(Breiman et al 



11198411 ') in their pruning algorithm and validated by Gey and Nedelec dGev and Nedelec 
1 2005|1) for me Gaussian regression case. This proposition validates the CART pruning penalty in a more 
general regression framework than the Gaussian one. 



Remark 2. In practice, since o~, p and R are unknown, we consider penalties of the form 

P „ ( „,r) = «'!3^!M( 1+log (^)) 

«2 «2 \ \\M\J) 

Ifp — 0, the form of the penalty is 

P en(M,T) = a^ +po 2M( l+ UJL)\ 
»2 n 2 \ \\M\J) 

the oracle bound becomes 

E[||J-3|& |Xil<Ciinf ( inf \\s - u£ + pen(M, T)\ + C 2 — , 

L ~ J (M,T) \ueS u j V J n 2 

and the assumptions on ||i||oo, p and N m i„ are no longer required. Moreover, the constants ao and (3q can 
be taken as follows: 

ao - 2(1 + 3 log 2) and /3o = 3. 

In this case o~ 2 is the single unknown parameter which appears in the penalty. Instead of using a' and [}' 
as proposed above, we can in practice replace a 2 by an estimator. 

• (M2) case : 

In this situation, the same subsample _£i is used to build the collection of models 

{S M , T , Me n^) and T<T^l) 

and to select one of them. 

For technical reasons, we introduce the collection of models 



{S m ,t, M e P(A) and T e M mM ] 
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where M ni ,M is the set of trees built on the grid {X; (X,-, F,) e .£1} with splits on the variables in M. This 
collection contains the preceding one and only depends on {X,-; (X,-, F;) 6 _£i). We find nearly the same 
result as in the (Ml) situation. 



Proposition 5. Let suppose that ||s||co < R, with R a positive constant. 
Let consider a penalty function of the form: V M e ^(A) and V T < T^l 

. ff p( 1+ ^^(i)) +P «)( 1+(l « l+1 ,( 1+ ,o g ( Pi i_)))m 

If p < log«i, a > ckq and/3 > /?o, 

then there exists three positive constants C\ > 2, C 2 andU which only depend on a and (3, such that: 
V£ > 0, with probability > 1 - e^Z - ^-^ H P *o, 

\\s - sf m < Ci inf f inf \\s - < + pen(M, T) 

(M,T) \ ueS M ,T 



where \\ . ||„, denotes the empirical norm on {X,-; (X, F;) G Xi) and c is a constant which depends on p and 

(T. 

Like in the (Ml) case, for a given |M|, we find a penalty proportional to H as proposed by Breiman ef 
a/, and validated by Gey and Nedelec in the Gaussian regression framework. So here again, we validate 
the CART pruning penalty in a more general regression framework. 

Unlike the (Ml) case, the multiplicative factor of ^, in the penalty function, depends on M and n\. More- 
over, in the method (M2), the inequality is obtained only with high probability. 

Remark 3. If p — 0, the form of the penalty is 

«i 



pen(M, T) = acr 2 



1 + (|M| + 1) 1 +log 



JM| + 1 

the oracle bound is V ^ > 0, with probability > 1 — e^Y., 



— +po- — 1 +log — 
«i «i \ \\M\ 



\\s-sr ni <C h mf t inf \\s-ut m + pen(M, T) \ + C 2 —{ 

ni (M,T) [ueS M j J «! 

and the assumptions on ||s||co and p are no longer required. Moreover, we see that we can take ao — /3o — 3. 

4.2. Final selection 

The next result validates this selection. 

Proposition 6. • In the (Ml) situation, taking p < log«2 and N m i„ > A a ^ R log«2, we have: 
for any % > 0, with probability > 1 - e" f - # P #o 2 (J'+ p r) 7^p> V/ 7 6 (°> *)» 



~i|2 

s — sf 



(l + iy-'- ? ) . nf n^^^Ha 

1/2 2 \ (21og7C + g) 
rf- \ 1 - 77 / « 3 
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• In the (M2) situation, denoting e(«i) = 27/ p ^o"i exp ^— ^Jjf^f iog Bl ) j> w have: 
for any £ > 0, wz'z7z probability > 1 — e - ^ - e(«i), V77 e (0, 1), 



HE < (1+ "' 2 1 " ?7) inf \\s-s(aMt 

1 / 2 2 „ ,„ 2 , \ (21og7C + £) 
+ — o- 2 + 4pR + 12p 2 log Wl - 2 

»r \ 1 - v I «3 

Remark 4. Tjfp = 0, fey integrating with respect to £ we get for the two methods (Ml) and (M2) that: 
for any 77 6 (0, 1), 



E 



\s - s\\ L&, £ 2 



1+77 
< - 



^ inf M\\s - 3(a,fi\& {&, £2]} 



77 2 SKB)eg 

+ — ^— -(21og7C+l). 

77^ (1 - 77) 773 

r/je conditional risk of the final estimator s with respect to || ||„ 3 is controlled by the minimum of the 
errors made by s(a,ft). Thus the test sample selection does not alterate so much the accuracy of the final 
estimator. Now we can conclude that theoretically our procedure is valid. 

Unlike the classification framework, we are not able, even when p — 0, to compare the final estimator s 
with the entire collection of models since the different inequalities involve empirical norms that can not be 
compared. 

5. Simulations 

The aim of this section is twice. On the one hand, we illustrate by an example the theoretical procedure, 
described in the Section [T] On the other hand, we compare the results of the theoretical procedure with 
those obtained when we consider the procedure restricted to a family P* constructed thanks to Breiman's 
Variable Importance. 



The simulated example, also used by Breiman et al. (see lBreiman et al.1 1 198411 p. 237), is composed of 
p — 10 explanatory variables X 1 , . . .,X U) such that: 

Pit 1 = -i) = Fix 1 = i) = 5 

Vz e {2, . . . , 10), ¥(X' = -1) = P(X' = 0) = F(X l = 1) = A 



and of the explained variable Y given by: 



Y = s(X\...,X 10 ) + s = 



3 + 3X 2 + 2X 3 + X 4 + s if X 1 = 1 , 
-3 + 3X 5 + 2X 6 + X 7 +s if X 1 = -1 



where the unobservable random variable s is independent of X 1 , . . . ,X 10 and normally distributed with 
mean and variance 2. 

The variables X*, X 9 and X 10 do not appear in the definition of the explained variable Y, they can be 
considered as observable noise. 
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The Table [T] contains the Breiman's Variable Importance. The first row presents the explanatory 
variables ordered from the most influential to the less influential, whereas the second one contains the 
Breiman's Variable Importance Ranking. 



Variable 


X 1 


X 2 


X 5 


X 3 


X b 


X 4 


X 1 X s 


X" 




Rank 


1 


2 


3 


5 


4 


7 


6 8 


9 


10 



Table 1 : Variable Importance Ranking for the considered simulated example. 



We note that the Variable Importance Ranking is consistent with the simulated model since the two 
orders coincide. In fact, in the model, the variables X 3 and X 6 (respectively X 4 and X 1 ) have the same 
effect on the response variable Y. 

To make in use our procedure, we consider a training sample £. which consists of the realization of 
1000 independent copies of the pair of random variables (X, Y) where X = (X 1 , . . ., X 10 ). 

The first results are related to the behaviour of the set of variables associated with the estimator s. More 
precisely, for given values of the parameters a and /3 of the penalty function, we look at the selected set of 
variables. 

According to the model definition and the Variable Importance Ranking, the expected results are the 
following ones: 

• the size of the selected set should belong to {1,3, 5, 7, 10). As the variables X 2 and X s (respectively 
X 3 and X 6 , X 4 and X 1 or X\ X 9 and X 10 ) have the same effect on the response variable, the other 
sizes could not appear, theoretically; 

• the set of size k, k e {1,3,5,7, 10), should contain the k most important variables since Variable 
Importance Ranking and model definition coincide; 

• the final selected set should be {1,2,5,3,6,4,7}. 

The behaviour of the set associated with the estimator S, when we apply the theoretical procedure, is 
summarized by the Table |2] 

At the intersection of the row /3 and the column a appears the set of variables associated with s(a,/3). 

First, we notice that those results are the expected ones. Then, we see that for a fixed value of the pa- 
rameter a (respectively /?), the increasing of /? (resp. a) results in the decreasing of the size of the selected 
set, as expected. Therefore, this decreasing is related to Breiman's Variable Importance since the explana- 
tory variables disappear according to the Variable Importance Ranking (see Table[T]l. As the expected final 
set {1,2,5,3,6,4,7} appears in the Table |2] obviously, the final step of the procedure, whose results are 
given by the Table |3] returns the "good" set. 

The Table [3] provides some other informations. At present, we do not know how to choose the pa- 
rameters a and /? of the penalty function. This is the reason why the theoretical procedure includes a final 
selection by test sample. But, if we are able to determine, thanks to the data, the value of those parameters, 
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p ^\ 


a < 0.05 


0.05 < a < 0.1 


0.1 < a < 2 


2 < a < 12 


12 < a< 60 


60 < a 


p < 100 


{1,2,5,6,3, 
7,4,8,9,10} 


{1,2,5,6, 
3,7,4} 


{1,2,5,6, 
3,7,4} 


{1,2,5, 
6,3} 


{1,2,5} 


{1} 


100 < p < 700 


{1,2,5,6, 
3,7,4} 


{1,2,5,6, 
3,7,4} 


{1,2,5, 
6,3} 


{1,2,5, 
6,3} 


{1,2,5} 


{1} 


700 <p< 1300 


{1,2,5, 
6,3} 


{1,2,5, 
6,3} 


{1,2,5, 
6,3} 


{1,2,5, 
6,3} 


{1,2,5} 


{1} 


1300 <{3< 1700 


{1,2,5} 


{1,2,5} 


{1,2,5} 


{1,2,5} 


{1} 


{1} 


1900 < p 


{1} 


{1} 


{1} 


{1} 


{1} 


{1} 



Table 2: In this table appears the set associated with the estimator s for some values of the parameters a and p which appear in the 
penalty function pen. 



a 


P 


selected set 


0.3 


-> 100 


{1,2,3,4,5,6,7} 



Table 3: In this table, we see the results of the final model selection. 



this final step would disappear. If we analyse the Table [3j we see that the "best" parameter a takes only 
one value and that p belongs to a "small" range. So, those results lead to the conclusion that a data-driven 
determination of the parameters a and p of the penalty function may be possible and that further investiga- 
tions are needed. 

As the theoretical procedure is validated on the simulated example, we consider now a more realistic 
procedure when the number of explanatory variables is large. It involves a smaller fa mily V* of sets of 



variab les. To determine this family, we use an idea introduced by Poggi and Tuleau in ( IPoggi and Tuleau 



|2006]) which associates Forward Selection and variable importance (VI) and whose principle is the fol- 
lowing one. The sets of *P* are constructed by invoking and testing the explanatory variables according to 
Breiman's Variable Importance ranking. More precisely, the first set is composed of the most important 
variable according to VI. To construct the second one, we consider the two most important variables and 
we test if the addition of the second most important variable has a significant incremental influence on 
the response variable. If the influence is significant, the second set of V* is composed of the two most 
importance variables. If not, we drop the second most important variable and we consider the first and the 
third most important variables and so on. So, at each step, we add an explanatory variable to the preceding 
set which is less important than the preceding ones. 

For the simulated example, the corresponding family P* is: 

9* = |{1}; {1,2}; {1,2, 5}; {1,2,5, 6}; {1,2,5, 6,3}; {1,2,5,6, 3,7}; {1,2,5, 6,3,7, 4} J 

In this family, the variables X s , X 9 and X 10 do not appear. This is consistent with the model definition and 
Breiman's VI ranking. 
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The first advantage of this family P* is that it involves, at the most p sets of variables instead of 2 P . 
The second one is that, if we perform our procedure restricted to the family P*, we obtain nearly the same 
results for the behavior of the set associated with s than the one obtained with all the 2 P sets of variables 
(see Table©. The only difference is that, since P* does not contain the set of size 10, in the Table|2] the 
set { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10} is replaced by {1, 2, 5, 6, 3, 7, 4). 



6. Appendix 

This section presents some lemmas which are useful in the proofs of the propositions of the Sections 
[4] and [3] The lemmas 1 to 4 are known results. We just give the statements and references for the proofs. 
The lemma 5 is a variation of lemma 4. The remaining lemmas are intermediate results which we prove to 
obtain both the propositions and their proofs. 

The lemma Q] is a concentration inequality due to Talagrand. This type of inequality allows to know 
how a random variable behaves around its expectation. 

Lemma 1 (Talagrand). Consider n independent random variables £\, ...,£„ with values in some measurable 
space 0. Let T be some countable family of real valued measurable functions on 0, such that H/IU < b < 
oo for every f £ f. 

Let Z = sup \ZU (/(£) - E [/(£)])| and o 2 = sup fe =1 Var[f(f,)]} 

feT feT 
Then, there exists K\ and two universal constants such that for any positive real number x, 



' (z > KiE[Z] +K 2 [o-^Tx + bxfj < exp(-jc). 



Proof, see Massart (Massac 



The lemma[2]allows to pass from local maximal inequalities to a global one. 



Lemma 2 (Maximal inequality). Let (S, d) be some countable set. 

Let Z be some process indexed by S such that sup |Z(f) - Z(u)\ has finite expectation for any positive real 

teB{lt,cr) 



o~, with B(u, cr) = jf e <S such that d(t, u) < cr 
Thenjor all <D> : R -> R + such that: 



<D(x) 



is non increasing, 



- Vcr > cr, E 



sup \Z(t)-Z(u)\ 

teB(u,cr) 



< 4>(cr), 



we have: 



Vx > cr, E 



\Z(t)-Z(u)\ 



S "sd 2 (t, u) + x 2 



4 



Proof, see Massart and Nedelec (IMassart and Nedelec 
ties", lemma 5.5. 



1 200611 ). section: "Appendix: Maximal inequali- 



Thanks to the lemma[3] we see that the Hold-Out is an adaptative selection procedure for classification. 
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Lemma 3 (Hold-Out). Assume that we observe N + n independent random variables with common distri- 
bution P depending on some parameter s to be estimated. The first N observations X' — (X' v . . . ,X'~) are 
used to build some preliminary collection of estimators (s m ) m£ M and we use the remaining observations 
{X\ , . . . , X„) to select some estimator 5,;, among the collection defined before by minimizing the empirical 
contrast. 

Suppose that M is finite with cardinal K. 
If there exists a function w such that: 



w : K -> K , 
w(x) 

x — > is non increasing, 



- Ve > 0, sup Var P (y(t, .) - y(s, .)) < w z (e) 

l(s,t)<£ 2 



Then,for all 8 e (0, 1), one has: 



(1 - 0) E [l(s, S^X'] < (1 + 6) inf l(s, s m ) + 6 2 t \26 + (l+logK)\l + l 



where dl satisfies to -\JnSl = w{5*). 



Proof, see (IMassartl [20031]), Chapter: "Statistical Learning", Section: "Advanced model selection prob- 
lems". ■ 



The lemmas |4] and [5] are concentration inequalities for a s um of squared random variables whose 
Laplace transform are controlled. The lemma [H is due to Sauve (Sauve 1 200911 ') and allows to generalize 
the model selection result of Birge and Massart (iBirge and Massard 1200711 ') for histogram models without 
assuming the observations to be Gaussian. In the first lemma, we consider only partitions m of {1, . . .,n] 
constructed from an initial partition mo (i.e. for any element J of m, J is the union of elements of mo), 
whereas in the second lemma we consider all partitions m of {1, ... , n). 

Lemma 4. Let s\, . . . ,s„n independent and identically distributed random variables satisfying: 

2 i2 



E[Si] = and for any A e (-1/p, 1 lp), log E [e Ae '\ < 



2(1 -p\A\) 



Let mo a partition of '{1, ... ,n) such that, V J € mo, \J\ > N m i„. 

We consider the collection M of all partitions of {I, ... ,n] constructed from mo and the statistics 



Xm = 2j ' m eM 

Jem 



\J\ 



Let 5 > and denote Sis — {^7 e mo; |2, e y s,| < (5<t 2 [7|J 
Then for any m € M and any x > 0, 



5 (xiUn s > o- 2 \m\ + 4cr 2 (l + P 5) j2\m\x + 2cr 2 (l + P 6)x) < e~ x 



and 



i r\ n l-6 2 o- 2 



pS) 
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Proof. Let m e M and denote, for any J era, 



Z/ = jjj A (6 <t \ J\) 

(Zj)jem are independent random variables. After calculating their moments, we deduce from Bernstein 
inequality that, for any x > 0, 



Y Z y > cr 2 |m| + 4cr 2 (l + bS) ^2\m\x + 2cr 2 (l + bd)x 



\Jem 



< e~ 



As Yij&n Zj = Afm on me set ^tfi we 8 et that for any x > 0, 

P(xl u a> ^ °" 2 NI + 4cr 2 (l + M) V2|m|x + 2cr 2 (l + bS)x) < e~ x 
Thanks to the assumption on the Laplace transform of the e„ we have for any J € mo 

-6 2 (r 2 \J\ ' 



J] e t \ > 5o- 2 \J\ 

V ieJ 



< 2exp 



As |/| > N m i n , we obtain 



2(1+6<S) 

6 2 a 2 N mil 



fi; AU M 2(1 +M) 



Lemma 5. Lef s\, . . . ,e n n independent and identically distributed random variables satisfying: 



3 



[e»] = and /or any ^ e (-1/p, 1/p), log E [e^] < — 



cr z A 2 



(1-pW) 



We consider the collection M of all partitions of{\,...,n] and the statistics 

2 _ V (2/ey £ i) 2 



Xr, 



OT € M 



Let 5 > and denote £1$ — |V1 < / < «; |e,| < 5cr 2 J 
Then for any m € A4 a«af any x > 0, 



' (x 2 m Un s > o- 2 \m\ + 4o- 2 (l + pS) V2NI + 2o- 2 (l + P 6)x) < e x 



and 



Proof. The proof is exactly the same as the preceding one. The only difference is that the set £lg is smaller 
mdN min = l. ■ 

The lemmas [6] and Q give the expression of the weights needed in the model selection procedure. 

Lemma 6. The weights xm.t = a\T\ + b\M\(l +log(pj)), with a > 2 log 2 and b > 1 two absolute 
constants, satisfy 



X ^ e*** <X(a,b) 

AfeP(A) t<T { ,"\ 



(6.1) 



ith Ha, V) = - log (l - e -(«- 21 °g 2 )) e E;. 
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Proof. We are looking for weights xm,t such that the sum 

s(Xi)= Yj Z e ~ XMJ 

MeP(A) T<Tl"l 

is lower than an absolute constant. 

Taking x as a function of the number of variables |M| and of the number of leaves \T\, we have 



= Z Z Z|{ r ^«i r i= D }h _ * 



D) 



A-=l MgP(A) D=l 

|»|=i 



Since 



we get 



|{r<r«; m = Z)}|<i 



2(ZJ - 1)' 
D - 1 



5 2D 



D 



^ * z(f)z^ 4wlog2)D) - 

Taking x(/fc, D) = aD + bk(l + log (f )) with a > 2 log 2 and b > 1 two absolute constants, we have 



2(£i)< 



V 1 

^ g -(i>-l)* ^ J_ e -(fl-(21og2))D 
A->1 ) \D>\ 



= Z(a,b). 



Thus the weights xm.t = a\T\ + b\M\ (l + log (j^))-. with a > 2 log 2 and b > 1 two absolute constants, 
satisfy (|67TT >. ■ 



Lemma 7. The weights 



,g I^TT))) m + T +l08 (^i) ,l 



with a > a«of b > 1 fwo absolute constants, satisfy 



2 J] e -^<Z'(a,&) 



(6.2) 



wif/i l!(a,b) — ^ „ t l '-(Ji) an^ Mn u M the set of trees built on the grid {Xf, (Xj, Y,) e _£i) w/f/i spZ/fs on 
the variables in M. 

Proof. We are looking for weights xmj such that the sum 

(X i ,Y i )e£ l }) = J] J] 

MeP(A) reA1„,,M 

is lower than an absolute constant. 

Taking x as a function of the number of variables |M| and the number of leaves \T\, we have 

p n, 

r») e X!}) = J] ^ Z |{r 6 a1 "- m; |r| = D)|e " fcC 

jfc=l MeP(A)|M|=A-D=l 
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Since the Vapnik-Chervonenkis dimension of SpM (the cl ass of admissible s plits w hich involves only the 
variables of M) is \M\ + 1, it follows from the lemma 2 in dGev and Nedeled |2005]) that 



{TeM m , u ; \T\=D}\ < 



\M\ + 1 



D(|M|+1) 



We get 



k=l y 1 D>\ 



Dl(k+l)(l+log(^L))]-x(k,D) 



P . k 

- S( ) Y^ eDl<k+m+l ° S( ^ ))] ~ X{k ' D) 
k=l D>1 



(6.3) 
(6.4) 



Taking x(k, D) = D[a + (k + 1) (l + fog (■^■))] + or(fc) with a > an absolute constant, we have 



£({X ; ; (Xi.yOeXi})^ 



<*)-*(l+fo«(§))) 



U>>1 



Thus taking x(/t, D) = D[a + (k+ l)(l + log fe))] + bk (l + /og (| )) with a > and b > 1 two absolute 
constants, we have 



£({*,; (m^i))< jy*- 1 * 



U>l 



/ Vd>i >/ 



= S'(fl,fe) 



Thus the weights x(M, 7") = \T\[a + (|M| + 1) (l + log (j^))] + b\M\ (l + log (^)) with a > and b > 1 
two absolute constants, satisfy i 



The two last lemmas provide controls in expectation for processes studied in classification. 

Lemma 8. Let (X\, Y\), (X„, Y„) be n independent observations taking their values in some measurable 
space X {0, 1}, with common distribution P. We denote d the L 2 (p) distance where p is the marginal 
distribution ofXj. 

Let S t the set of piecewise constant functions defined on the partition T associated to the leaves of the tree 
T. 

Let suppose that: 

3h > 0, Vx e ©, \2r](x) - 1| > h with tj(x) = P(F = 1\X = x) 



Then: 

(i) sup d(s, u) < w(s) with w(x) = -j=x, 

u£St, /(s,w)<£ 2 

(ii) 3(p T : R + -> K + such that: 

• r (O) = 0, 

• x — > ZJ -^- s is non increasing, 

• Vcr > w((Tt), y/nE sup \%(u) - y„(v) 

_u£St, d(u,v)<cr 

with o-j the positive solution of <Pt{w{x)) = yfnx 2 . 

..... 2 ^ K%\T\ 
(m) cr T < -ij-. 



< <Pt(o~), 
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Proof. The first point (;) is easy to obtain from the following expression of /: 



l(s, u) = E (\s(X) - u(X)\ \2rj(X) - 1|) 



The existence of the function (f>j has been proved by Massart and Nedelec (IMassart and Nedeled 1200611 "). 



They a lso give an upper bound of c r 2 based on Sauer's lemma. The upper bound of o~\ is better than the 



one of (IMassart and Nedeled J2006J) because it has been adapted to the structure of S j. 



Thanks to lemma © and ©, we deduce the next one. 

Lemma 9. Let (X\, Y\), (X„, Y n ) a sample taking its values in some measurable space x {0, 1), with 
common distribution P. Let T a tree, S t the space associated, h the margin and Kj, the universal constant 
which appear in the lemme]8l Iflx > ~7=j~ » then: 



lr«(«) - 7«0)l 



sup „ 

ueSjd v) + (2x) z 



x V« 



7. Proofs 



7.1. Classification 

7.1.1. Proof of the proposition^} 

Let M € P(A), T < and s m ,t zS m ,t- We let 



Wm'J'(u) = (d(s, s m ,t) + d(s, u)) 2 + y 2 M , T , 
\jn 2 {u) - j1 2 (sm,t)\ 



V m >,t> = sup 

ueS m , t 



W M ',T'(u) 



where yM>,r is a parameter that will be chosen later. 
Following the proof of theorem 4.2 in (lMassard J2000J]), we get 



l(s, s) < l(s, s m ,t) + Wjrrf(s) x Vjrrf + pen(M, T) - pen(M, T) 



(7.1) 



To control Vgj., we check a uniform overestimation of Vm>,t'- To do this, we apply the Talagrand's con- 
centration inequality, written in lemmaQ] to Vm>,t'- So we obtain that for any (M', T'), and for any x > 



V M >,v>K^[V M ,j,]+K 2 




X _] X _ 2 

Ln 2 n 2 



where K\ and K 2 are universal positive constants. 

Setting x — xm>,t' with £ > and the weights xm'.t = a\T'\ +b\M'\ (l + log(^)), as defined in lemma 
[6] and summing all those inequalities with respect to (M' , T'), we derive a set Q.^(m,t) such that: 

. V^^lCi and {X u (X u Yd e £ 2 )) < e~^{a,b) 
• on%, Mr) ,V(M',r), 



V M >,r <K 1 E[V M ',t>] + K 2 



\x M , T +£, _j x M ,j,+% _ 2 ) 

JM'J' H " yM'.T 



2n-> 



"2 



(7.2) 
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Now we overestimate E [Vm\t' 



Let iiM'j' £ S m',t such that d(s, Um',t>) - inf d(s, u). 
Then 



E[V m >,t'] <E 



\jm{UM' J') ~ Jm(SMj)\ 



inf (w M ',t'(u)) 

LIES m' J' 



+ E 


sup 







\y ni (u) - y ni (uM'j')V 

W M ',T'(u) , 



We prove: 



E 



\y ni (uM>,T>) - y ni (s M j)\ 



inf (w M ',T'(u)) 

ueS M , r 



\JniyM\T' 



For the second term, we have 

'\fn 2 {u)-y~n 2 {UM',T')\ 



E 



sup 

iteS M i ji 



< 4E 



W M ',T'{u) 

By application of the lemma[9]for 2yw,T' ^ ~/$fjT' we deduce 

\y~n 2 {u) ~ y„ 2 {UM',T')\ 



\ym(u)-y n AuM',T')\ ^ 

sup I — - 

ueS M , r \d {u, Um',T') + Q-ywj'Y } 



E 



sup 

W£ S M'.T' 



W M ',T'(u) 

Thus from ( 17.2b . we know that on ty^Mj) and V(M', T') 
A', 



Vm> t> < 



^niyw 



— (8K 3 + l) + ^2 [ a/ '"V ' C ,v>r /■ : 



2«? 



For y M ,, T , = 3* ( 8 tf 3 A^H + l) + *2 + _^ ^S£2f) 



With ^ ^ 48fHi> We § et: 



W,7" < t; 



By overestimating vvrT^S), y 2 — and replacing all of those results in ( 17. U . we get 



M,T 



Kh 



Kh 



1 / (s, s) < 1 + — \l (s, s m ,t) - pen(M, T) + pen(M, T) 



+ 18AT 



+ 18AT 



64/Ti 2 /: : 2 



"2 



"2 



1 

+ 



<2K 2 P 
— !- + 2K 2 — 

n 2 "2 



K 2 1 
+ 



H 2 ^Jk 

\ 2 \ 



2\ 



IV 2 a/31J 

Taking a penalty pen(M, T) which balances all the terms in (m, i.e. 



Welettf= 2^4 withCi > 1 

n Ci - 1 1 



pen(M, T) 



36(Ci + 1) 



fc(Ci - 1) 



64K 2 K 2 xmt 
3 \T\+2K T J 



"2 



"2 



^2 Cl-1 
+ 



2 V6(Ci + l) 



2\ 



We obtain that on O, 



f,(Af,r) 



/(i, S) < chs, s m ,t) + pen(M, T)\ + —J 
\ I n 2 h 

Integrating with respect to £ and by minimizing , we get 

< Ci iafil(s,S MT ) + pen{M,T)\ + -^-E(a,fc) 
\ m,t{ ) mh 



E 
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In brief, with a penalty function such that 
VMef(A), VT<TM 



pen(M, T) 



rt2h ri2h \ 



+ log 



P_ 

\M\ 



36(d + 1) 



Ci-1 

36(Ci + 1) 
C, -1 



6AK 2 X K 2 + 2aK 2 



h(d - 1) 



2^2 



ft(C, ~ 1) 



2 \6(Ci + l) 



'6(Ci + 1) 



|M| 



we have: 



E 



< Ci inf(z(i,S M r) + peniM,T)\ + — I(a,fe) 
M,r I J « 2 /j 



1 + log — 
1 |M| 



We notice that, the two constants ao and {So, which appear in the proposition Q] are defined by 



ao = 36 



64^/T 3 2 +41og2/T 2 



#2 

2 + V6j 



□ 



7.7.2. Proof of the proposition^ 

For M, M' € !P(A), T < 7"^, 7" < T^'J and s M , r € S M ,r- We let 

• W( M ',T'UM,T)iu) = idis, s m ,t) + dis, u)) 2 + iy M 'j' + yMj) 2 

„ tyntiu) ~ 7~m(SM,T)\ 

• V(M',T'),(M,T) = SUp — — 

ueS u , r W(M',T'),(M,T)\U) 

where yu',T and yMj are parameters t hat will be chos en later. 
Following the proof of theorem 4.2 in (IMassard 11200011 '). we get 

lis, s) < lis, s MJ ) + w ( Mj. ) (MT) is) X y ( Mj )XMJ) + peniM, T) - pen(M, T) 

To control V ( ^-f^ MT) , we check a uniform overestimation of V(m',t'),(m,t)- To do this, we apply the Tala- 
grand's concentration inequality, written in lemmaQ] to V(m',t'),(m,t), for iM',T') e P(A) x M. nu w and 
(M, T) e f(A) x M.„m ■ So we obtain that for any (M',M) e !P(A) 2 , any 7" e M„ uM >, any M e M.,,m 
and any x > 0, 



(7.3) 



V(M' ,T'),(M,T) - KiK[V(M>,T'),(M,T)] + K2 



,,-2 



«2 



where K\ and Tf? are universal positive constants. 

Setting x = x m >t> + x m ,t + £ with £ > and the weights x^r- = (a + i\M'\ + 1) (l + 'og(jj^n-))) l^'l + 
fc|M'| (l + log ( p^Tf))-. as defined in lemma [7] and summing all those inequalities with respect to (M', T') 
and (M, T), we derive a set such that: 

• on %, V(M', T'), (M, T) 



< TTiE [V(M' ,T'),(M,T)] + K2 



XM'.T' + Xm,T , ,_i 

~ iywj' +yu,T) 

In 1 



„ (x M >,T> + Xm,T , ,_ 2 \ 

+7^2 CyM'.r- + y*f,r) 

\ "1 / 



(7.4) 
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Now we overestimate E [V(m',t'),(m,t)] ■ 

Let um'.t' £ S M',T' such that d(s, um>,ti) ^ inf d(s, u). 

U^S M'.T' 

Then 

\V ni (uM',T>) ~ "fn^SMj)] 



E [V(M-,ro,(M,r)] < E 



inf (W( M -,r'),(M,r)(«)) 

U€S M r : T' 



We prove: 



E 



lr«i("M',r) -r«,(*M,r)l 



inf (w( M ',r'),(Af,r)(«)) 

UES m iti 



+ E 


sup ( 




U£S M ',T' \ 




1 



lr«,( M ) - r«,("M',r')l 



For the second term, we have 

'\fnM)-ynS u M',T')\ 



E 



sup 



< 4E 


sup 




_ueS M ',T' \ 



ueS M , r \ W(M>J'),(M,T){u) 

By application of lemma[9]for 2y M \T' > 

'lr«i(") -JtiMm'.T' 



VnT(>M',7-' + jM,r) 

lyni(»)-yni(«M',r')l 



E 



sup 



y[n\{yM',T' + yMj) 



ueS M > x > \ W(M'J'),(M,T){u) 

Thus from O, we know that on % and V(M', 7"), (M, 7/) 

#1 /„„ rr^-r .\ „ f jx M ',r +x M ,T + J. 



V(M',T'),(M,T) 



^1 lojr r^r7\ i\ v ( l x M',T> + Xm.T +£, x _] ) 

(8X3 VF'I + 1) + K 2 \ ~ CvM'.r +yM,r) + 

M'.r + Ym.t) v y I V 2«i J 



„ / Xm'J' + *MJ +£, ,_ 2 
#2 CyAf',7' + yM.T) 



For y M ,, T = 3K ( 8 * 3 VP*[ + l) + *2 V^f^ + vfe #^^) 
with ^ ^ 48ib' we § et: 

1 

V(M',r'),(M,7") ^ ^ 

By overestimating w { ^j^ (mt)(^)> yj^j ar, d replacing all of those results in (17.3b . we get 



+36K 



+36K 



3 |f I + —\T\ + 2K 2 



IK- 



«i 

XM,T 

ni 



«i 



IKo 1 
+ 

2 



2 + V3tf 



\ 2 ^ 







'\K\ 




+ 36/: 











+ 2Ki — 



\ 2 \ 

2 + VSTJ, 



Welettf= |§^}withCi > 1. 

Taking a penalty pen(M, T) which balances all the terms in (m, r), 

72(Ci 4 1) 



A(Ci - 1) 



"1 



|7/| 4 2^2 
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xm.t 



Ci-1 
6(Ci 4 1) 



2\ 



We obtain that on D, f 



l(s, 3) < ichs, s MJ ) + pen(M, T)\ + — j + 

V / n\h n\h 

In brief, with a penalty function such that VM e !P(A), forallT < T { J^ X 



pen(M, T) 



ff £l( 1 + (|M| + 1) ( 1+tos (_^_ 



72(C, + 1) 



Ci -1 



64/^2 + 2^ 



Ci-1 
6(Ci + 1) 



2\ 



x [a + QM\ + 1) 1 +log 



|M| + 1 



72(C 1 + 1) 
+ C,-l 2 * 2 



C, - 1 



V 2 ^6(Ci + 1) 



l + log U 

mh\ 8 \\m\ 



we have 



l(s, s) < 2cAl(s, s m ,t) + pen(M, T)} + (1 + 
y ' n\h 

We notice that the two constants <?o and fio which appear in the proposition|2]are defined by 



ff = 72 



(AK\kI + 2K 2 



IK2 J_ 
2 + V6 



and y8 = 72 x 2 x K 2 



IK 2 1 



□ 



7.1.3. Proof of the proposition® 

This result is obtained by a direct application of the lemma[3]which appears in the subsection|6] □ 

7.2. Regression 

7.2.1. Proof of the proposition® 

Let a > 2 log 2, b > 1, G e (0, 1) and K > 2 - 6 four constants. 
Let us denote 

smj - argmin \\s - u\^ and sm,t = argmin \\s - u\f m 

u ^S MJ W£ S M.J 



Following the proof of theorem 1 in (IBirge and Massartl ll2007l0 . we get 

(1 - 0)\\s ~ s\\ 2 „ 2 = Aftf + '^Rm.t 

where 



(7.5) 



A M .r = (2 - 6)\\e M j\\1 2 -2<s,s- s mj >„, -6\\s - s MJ \\l 2 - pen(M, T) 
Rm.t = lis - smjW^ - \\smjWI, + 2 <s,s- s mj >„ 2 +pen(M, T) 

We are going first to control by using concentration inequalities of ||£M,rll^ 2 and - < s,s - s m ,t >m_- 



For any M, we denote 



Q-M = { 'it & Tmax 



X,et 
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< — \Xi e t\ 
P 



Thanks to lemma|4] we get that for any (M, T) and any x > 

^2 ^2 T. 



and 



(2 2 2 

llejHirlfc Un„ > —\T\ + 8— V^7> + 4—x |Xi and {X,-; (X h Y,) e £ 2 ) 
m n 2 n 2 I 



|Xi and [Xf, (Xi, Yd e £ 2 }) < 2^- exp (^^iJ 



(7.6) 



Denoting Q. = (~)£Im, we have 

M 



' (q c |Xi and {Xr, (X^ Y t ) e £ 2 }\ < 2 P+1 exp ( 



-cr 2 N m 



4p 2 

To control — < e,s— sm.t >n 2 < we calculate its Laplace transform. Thanks to assumption ( 12.3b and \\s\\a, < 
R, we have for any (M, T) and any A e (0; ^J, 

A 2 cr 2 \\s - s M r\\i 



-A<e,s-s mt >„ 



LCi and {X t ; (X h Y t ) e £ 2 ) 



2" 2 (1-^) 



Thus, for any (M, T) and any x > 0, 

P(- < s, s - s MJ >,„ > -?—\\s - s MJ \\ m y[Tx + ^-x |Xi and {X; (X,-, T,) e £ 2 }) 

< e~ x (7.7) 

Setting x = i^j + ^ with £ > and the weights x^r = a|3H + b\M\ (\ + log(j^j)) as defined in lemma|6l 
and summing all inequalities ( 17.6b and ( 17.71 ) with respect to (M, T), we derive a set Eg such that 

• P(££ |Xi and {X,-; (X, T,) € £ 2 }) < 2e" f E(a,Z>) 

• on the set £ f f] Q, for any (M, T), 

2 2 2 

A M ,r < (2 - 0) — \T\ + 8(2 - 6)— y/2\T\(x M , T + f) + 4(2 - &)— (x MJ + # 

«2 «2 «2 



+2— —||s - s MJ \\ m ^2(x MJ + £) + 4 — (x MJ + £) 

V"2 "2 



-All* - Sm,t\L ~ PeHM, T) 



where fc) = - log (l - e -(«- 21 °g 2 )) j^ry. 

Using the two following inequalities 

2^=11* - s^rlk yj'Kxuj+O < 6\\s - s MJ \\i + \^_(x M j + f), 
2tJ\T\(x m ,t+0 < rj\T\ + r]- l (x M , T + £ 

with 77 = 4^/2 > °' we derive 1,131 on tne set E t H f° r any (M, T), 

cr 2 2 / 2 \ cr 2 

A M , r < (2 - 0)—\T\ + 8 V5(2 - 0)— ^\T\(x MJ + £> + 4(2 - 0) + - + —(x M j + © - pe«(M, T) 

«2 «2 \ 9 cr l ] « 2 

cr 2 / / 8(2 - 0) \ 2 p \ cr 2 

< K —\T\ + 4(2 - 0) 1 + \ ; + - + 44« — (x MJ +0- pen(M, T) 



n 2 \ \ K + 6-2) 6 cr 2 ) n 2 
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Taking a penalty pen(M, T) which compensates for all the other terms in (M, T), i.e. 



pen(M,T) > K—\T\ + 

«2 



cr 

«2 



we get that, on the set 



A^Ho<|4(2-g)|i+ 8(2 " e) ) + 1 + 4A;r)-{ 

M ' T 1 V 1 K + 6-2 9 a- 2 J « 2 



Integrating with respect to we derive 



E 



A^HnlXil < 2 (4(2 - 0) (l + + | + 4^) gsfe ft) 



(7.8) 



(7.9) 



We are going now to control E inf Rmt^-ci Xi 

(MX) ' I 

In the same way we deduced ( 17.71 ) from assumption ( 12.3l l. we get that for any (M, T) and any x > 

P( < s, s - 5M.7- > ni > ~^=\\s - SuAm + — * ki and {*,•; (X if F,) € £ 2 }) 

V V"2 "2 1 ; 



Thus we derive a set such that 

• p(f£ UCi and {X*; (X,-, T,) e £ 2 }) < <rf£(a,ft) 



• on the set Ff , for any (M, T), 



V«2 



It follows from definition of Rmj that on the set Fg, for any (M, T), 

4-pR 



9 U y *Tfjl\ 

Rmj = < \\s - s M j\\ m + 2— —\\s - s MJ \\ m ^2{x MJ + if) + (x M j + £) + pen(M, T) 

V"2 "2 



< 2||a - s MJ \\l + (2 + 4-^) — (xm,t +0 + pen(M, T) 

< 2||* - sh.t\L + 2pen(M, T) + {t. + A^-r) — £ 



And 



E 



inf R M .T^n\-C\ 

(MJ) I 



< 2 inf |E 

(Af, 



nf (l 



l-S - SM,rlU-£l 
.2 



+ pen(M, T) 



+ \\ +4^-r\— Ha,b) 
(T 2 } « 2 



We conclude from dT9b and d77T0b that 



(1 - 0)1 



k-Sll'Unki 



2 inf {E^-^rll^lx, 



• pen(M, 7) 



+ ( 8 (2-0)(l + ^|) + | + 12^)gs(a ) ft) 
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(7.10) 



It remains to control 

E 



\s-S\&Vt*U 



s-s\\ 2 n M£i 



„ 2 u-n'\-Li 

= E 

< E 

< R 2 



k-^niUftkii + E 



\s m \\ 2 M£i 



\sf n2 Ho\£i + J] E \\s MT mf n2 n&\£ 

M 



fE|||e Mi7 .(M,||4 L& 



'(n c |-Ci) 



As 



E 



m '- n 



< C 2 (p,cr) 

where C(p, cr) is a constant which depends only on p and <x. 
Thus we have 



E 



\s-s\\lM£i 



< R 2 



Let us recall that 



For /> < log n 2 and iV mi „ > log n 2 , 
• 2pJf(q4£i) < -g= J= 



It follows that 



log n 2 



E 



< C'(p,cr,R) 



1 



« 2 (log n 2 ) 3/2 



Finally, we have the following result: 

Denoting by T = [4(2 - 0) (l + f 2 ^) + §] and taking a penalty which satisfies V M e !P(A) V T < T, 







pen(M,T) > {(K + aT) cr 2 + 4apfl) ^ + (bYcr 2 + 4Z?pfl) ^ |l + log jj 
if p < log«2 and iV m ;„ > log n 2 , we have, 

(1 - 0)1 - S|| 2 , |Xil < 2 inf I inf \\s - M || 2 + pen{M, T)\ 

- 1 (M,T) {ueS M .T ^ ) 

+ (2T + 2 + 12-^fl) — Z(a, b) 
\ cr 1 I m 



+(l~0)C'(p,cr,R)- 



1 



« 2 (log h 2 ) 3/2 

We deduce the proposition by taking K -2,0 — > l,a — > 2 log 2 and — > 1 . 
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7.2.2. Proof of the proposition^ 

Let a > 0, b > 1, 6 e (0, 1) and K > 2 - 6 four constants. 

To follow the preceding proof, we have to consider the "deterministic" bigger collection of models: 

{S M .T\ T eM„ uM andMer(K)} 

where M„ u m denote the set of trees built on the grid {X,-; (X,-, F) 6 _£i} with splits on the variables in M. 
By considering this bigger collection of models, we no longer have partitions built from an initial one. So, 
we use lemma [5] instead of lemma|4] 
Let us denote, for any M e P(A) and any T e M„ u m, 

sm,t - argmin \\s - k||^ and em,t = argmin \\s - u|£ 

U£S M,T MJ 



Following the proof of theorem 1 in (IBirge and Massard 12007|]), we get 



(1 - 0)Hi - 3||* = A m + MR MJ (7.11) 



where 



Am.t = (2 - 8)\\£M,T\\ ni ~2<s,s- s MJ >,„ -G\\s - s Mr ||* - pen(M, J) 
Rm.t = \\s - sm.tWI, - W^m.tWI, + 2 <s,s- s mj >„, +pen(M, T) 

We are going first to control A-gj. Let us denote 

• Q = {V 1 < i< n u \si\ < do- 2 } 
Thanks to lemma[5] we get that for any M e P(A), T e M„ u m an d an Y x > 

(2 2 2 \ 

lleM.rll'.Un > — \T\ + 4(1 + p6)— V^7> + 4— x \{Xf, (X, F,) 6 

<e~ x (7.12) 

and 



>(q c (X i) F i )eX 1 })<2n 1 expl 



2(1 +p<5) 



Thanks to assumption (12.3b . like to the (Ml) case, we get that for any M e P(A),T e At nil M and any 
x > 0, 



J (- < e, s - s MJ >,„ > -^—\\s - smjU V2l + ^-x \{Xf, (X it F,) e £1}) 



(7.13) 



Setting * = with£ > and the weights x MJ = (a + (|M| + 1) (l + log (j^))) l r l +b(l + log (^)) |M| 

as defined in lemma [7] and summing all inequalities ( 17.12b and (17.13b with respect to M e !P(A) and 
T e vH 71j m, we derive a set E f such that 

. P(£| \{X,; (X h F) €£!})< 2e-^(a,fo) 
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on the set E ( f| ^, for any (M, T), 

2 2 2 

A M ,r < (2 - 0)—\T\ + 4(1 +p5)(2 - 0) — ^2\T\(x MJ + J) + 2(1 + p5)(2 - 0)—(x m ,t + f) 

o~ i pR 

+2—=\\s - s MJ \\ m V2(x M ,r +0+ 4— (x m ,t + O 
V«i "l 



-All* - s m ,t\\1 -pen(M,T) 
where fc) = T £ ^ r ^JV-d • 
Using the two following inequalities 



2 Vir|UM,r+^) < »7ir| + ttW.t + 
with 77 = > 0, we derive that on the set E f f| Q, for any (M, T), 



A M , r < (2 - 0) — |T| + 4 V2(l + pS)(2 -0) — tJ\T\(x m .t+& + (2(1 + P 5){2 -0) + ~ o + 4-^r) — (x m ,t + - pe«(M, 7") 
«i «i \ cr^ J n\ 

a 2 I I 4(1 + pS)(2 - 0)\ 2 p \cr 2 

< jf—in + 2(1 + ptf)(2 - 0) 1 + ^ A - + - + — (x M , T +Q- pen{M, T) 

n\ \ \ K + — 2 J 8 cr l j n\ 

Taking a penalty pen(M, T) which compensates for all the other terms in (M, T), i.e. 



pen(M,T) > K—\T\ + 
"1 



we get that, on the set Ec 



2(i +P fl ( 2-e)|i + 4(1+ ^ (2 - e) U + 44* 

v H J ; 1 K + 0-2 ) 9 cr 2 



(7.14) 



We are going now to control inf Rmt- 

In the same way we deduced (17. 1 3b from assumption ( 12.31 ). we get that for any M e P(A) and T e M„ u m 
and any x > 

P( < s,s- s M ,T >,„ > -%=\\s - s u ,tU V2I + — x \{Xf, (X u Yd e £1}) 



Thus we derive a set Fg such that 

. f(f° \{Xr, (Xu Yd e £1}) < e-fT{a,b) 

• on the set Fg , for any (M, T), 

',tL, J 2 ( x m,t+£) + — (x m ,t + 
' «1 



cr /— 2pfl 

<E,S- S M ,T >n, ^ — p=P - «M,2 



V«7 
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It follows from definition of Rm,t that on the set Ft, for any (M, T), 



2 u i 4-pR 

R m ,t = < \\s- s MJ \\- + 2— — \\s - s M ,T\\ ni V 2 (*M,r + + ( x m,t + £) + pen(M, T) 

< 2\\s - sm.tWI +( 2+ 4 ^2 R ) + ® + P en ^ T) 

< 2\\s - s m ,t\\1 + 2pen(M, T) + (2 + ^ 



We conclude that on n Fg n Q 



(1 - - s\t < 2MJ\\s - s M , T \t + pen(M, T)} + Y— £ 



(M,T) 



"1 



And, for p < log{n\), 



H 2exp 

ni 



' 7?-(l°gni) 2 + ^f(logrn)(loglogrn) + Alogn^ 



2(l + 5 -£lo gni ) 



Finally, we have the following result: 
Denoting by 

T = 2(1 +pS)(2- 0)11 + 



2(1 + p6)(2-0)\ 2 



K + 6-2 



[2(l+5^/og|^j)(2-0) 



' 4(1+5 ^i gi-iy 2 -ey 



K + 0-2 



and 



e(ni) = 2e;t/? 



-^-(Jogni) 1 + ^-(logmXloglogm) + Mogn x 



2(1 + 



Taking a penalty which satisfies: V(M, T) V M e !P(A) V T <T, 



iM) 



pen(M,T) > K—\T\ 



\a + (\M\ + 1) 1 +/og 



+ ( T+4 ^ s )C i, ( 1+tos (iSi)) 1 " 1 



"1 



|M| + 1 



\t\ 



we have V£ > 0, with probability > 1 - 3e ^S(a, fo) - ^-e(«i) 

(1 - e)\\s - sf ni < 2 inf - s MJ \l + pen(M, T)} + (r + 2 + 8^) — £ 
We deduce the proposition by noticing that 
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and by taking K -2, — > 1, a — > and b — > 1. 



□ 



7.2.3. Proof of the proposition® 

It follows from the definition of S that for any s{a,fi) e ^ 

=,||2 



\\s - III* < ||i - sO.J0)l6 3 + 2 (e, ! - S(a,j8)) s 



(7.15) 



Denoting M a ^ >a >^ = max\\5(a' ,fS')(Xi) - S(a,j3)(Xi)\ ; (X,-, T,) G .£3}, we deduce from assumption (12. 3t 
that for any s(a,/3) and s(a',/3') G ^ 

fogE [expO* < e, >„ 3 )|Xi, £2 and {X; (X,-, F,) e £ 3 }] 



if Ul < 



P«„JJ.n 



2« 3 (l-^M„. /Jj ,,^Ml) 

Thus we get that for any s(a,/3), s(a',/3') 6 Q and x > 

P (<e, ?(«',/?') - ?(«,/?)>,„ > ^||S(ff',/?') - s(ar^)|| n3 V2I + M a „^x 

|Xi, (X/J,^^))^" 

Setting x = 2 log *7C + £ with £ > 0, and summing all these inequalities with respect to s(a, /?) and s(a' ,/}') e 
Q , we derive a set such that 

• P(££ |Xi, £2, and {X; (X,-, F,) e £3}) < e" f 



on the set E^, for any s(a,/3) and s(a',/3') G ^ 



<£,s(a',/?')-s(c*,y0)>„, < — ||S(ar',y8') - 5(ar,j8)|| n , V2(2 log 7C + <f) 

Y«3 

+M aW ^-(21og-7C + ^) 

«3 

It remains to control M a ^^^ in the two situations (Ml) and (M2) (except if p = 0). 
In the (Ml) situation, we consider the set 









Qi = n ■ 




z •> 


MeP(A) 




Xj€t 



<R\{i; {Xj, Yf) G X2 and Xj G f}| 

Thanks to assumption ( 12.3l l. we get that for any A G (— 1/p, 1 /p) 

/ogfifexp (vl 2(x„y, )e x 2 ,x,E, c,) l-Ci and {X,-; (X,-, T,) 6 £ 2 }] 



~ 2(l-pU|) 



{;'; (X,-, Yd G £ 2 and X,- G t}\ 



It follows that for any x > 

Z 



X.er 



> X 



£1 and {X ; ; (X,-, Y t ) e £ 2 ) 
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Taking x = R (X,-, Yf) e .£2 and X,- e f}| and summing all these inequalities, we get that 

-R 2 N mh , \ 



• (fljl £1 and {X,-; T,) e £ 2 }) < 2" +1 exp ( 



2{a 2 +pR), 



On the set Cl\, as for any (M, T), pM.rlU ^ 2/?, we have M a fi^fi' ^ 4/?. 
Thus, on the set Q.\ f] Eg, for any s(a,/3) e Q 



(s, = S -S(a,/3)) < -?=\\S - s(a,P)L, ^(Hog'K + £) + 4R^-(21og<K + 



It follows from (17.15b that, on the set Qi |~| Eg, for any s(a,/3) e and any 7; 6 (0; 1) 



2 cr 2 8pfl 



\s - s\\l < P - S(ff,/?)ll«3 2 + (1 - rj)fs - S(ff,/?)||« 3 2 + — {2logK + + ^(2logK + 

1 - 77 «3 «3 



and 

(21og<7C + £) 



t? 2 ||* - f||* < (1 + 77- 1 - 77) \\s - S(a,P)\\ 2 ni + (t^^ 2 + %P R ) 
Taking p < log «2 and iV m ;„ > 4 ' T log 772, we have 



«3 



2(cr 2 +pR) n l 2 ~ io s 2 



Finally, in the (Ml) situation, we have 

for any f > 0, with probability > 1 - e"? - "riix, V77 6 (0, 1), 



~,,2 (1 + 77- 1 - 77) , 1/2 , \(21og7C + £) 

s-s\\ <- '-. '- inf s-s (a,0) 2 + -=■ a 2 + 8pR 

""3 jf K<*&& ?7 2 \l-77 / 

In the (M2) situation, we consider the set 

Q 2 = {VI < i<m \s t \ < 3p\ogn x \ 
Thanks to assumption ( 12.3l l, we get that 

with c(ni) = 2»iexp(- . ^ p — -I — > 
On the set Q.2, as for any (M, T), pM.rlU < R 
Thus, on the set Q2 f] Eg, for any s(a,/3) e 



"3 



n\ — >+oo 

On the set O2, as for any (M, T), pM.rlU < ^ + 3plog77i, we have M a ^ ta >^ < 2{R + 3plog«i). 



(sj - s(a,J3)) < -^=||f - s(a,JJ)\\m ^2(2 log % + + 2(R + 3p log m) -^(2 log % 
x '"3 y7!3 773 

It follows from ( 17.151 ) that, on the set f2 2 f] Eg, for any s(a,/3) e Q and any 77 € (0; 1) 

2 2 

P-!|| 2 3 < P - 5(ff,/3)|| 2 , + (1 - 77)||! -~s{ a ,p)\\l i + ^—?-(2logK + t) 

4p(R + 3plogn l ) „ 1 
+ (2logK + 
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and 

logj7C + 

«3 

Finally, in the (M2) situation, we have for any £ > 0, with probability > 1 — e ~^ - e{ri\), V77 € (0, 1), 



t? \s - Sf m < (1 + if x - rj) \\s - s(a,p)\\l + (t^^ 2 + W R + 3 P lo 8"i)) ' 



\s-S\\l < {l+T1 l n) inf \\s-s(a,/3)\\l 3 

1 "" 3 7] 2 S(a,/3)eg n> 



1 / 2 , „ ,„ 2 , \(21og , 7C + ^) 
-— o- 2 + 4pfl + 12p 2 log«! 2 ^ 

»r \ 1 - v ) "3 



□ 
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